Fix callback not being called after activesock shutdown (#4864) by azertyfun · Pull Request #4878 · pjsip/pjproject

azertyfun · 2026-03-19T17:26:39Z

This is critical to avoid a reference leak. The callback is responsible for calling pjsip_tx_data_dec_ref.

Background: When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks are silently skipped. This causes tdata reference leaks since upper layers rely on these callbacks to call pjsip_tx_data_dec_ref(). The bug only manifests when the ioqueue fast-track fails (high load, full send buffer), making it nearly invisible in normal testing. Exists with or without PJ_IOQUEUE_CALLBACK_NO_LOCK.

Approach: Drain pending write callbacks during key unregistration, handling two lists with different semantics: write_cb_list (completed ops, deferred callback) invoked with success status respecting write_callback_thread serialization, then write_list (pending unsent ops) invoked with -PJ_ECANCELLED.

Maintainer rework addressing review feedback:

No deadlock: callbacks invoked without holding key/ioqueue locks. Key lifetime protected with grp_lock ref counting.
All maintained backends: epoll, kqueue, select, IOCP. Skipped deprecated symbian/uwp.
IOCP: cancelled overlapped ops invoke write callback with -PJ_ECANCELLED; write_callback_thread properly cleared in closing path.
activesock: forward on_data_sent when SHUT_TX set (not on_data_read).
PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing async path.
Regression test: concurrent sends + close before poll, with SO_SNDBUF shrink to force async path.

Note: SSL socket send_buf_pending fixes (flush_delayed_send, ssock_on_data_sent error path) have been deferred to a separate PR. The SSL send path has pre-existing issues that require a more thorough refactor — see discussion in review comments. The ioqueue-level fix in this PR is sufficient to address the original #4864 tdata leak.

CLAassistant · 2026-03-19T17:26:50Z

All committers have signed the CLA.

azertyfun · 2026-03-19T17:32:18Z

This fixes the issue as described & reproduced in #4864.

I am moderately confident that this is a necessary and sufficient fix, however the whole architecture around TLS transport is quite difficult to get my head around. There are many points where a tdata can be delayed or cancelled. Hopefully someone a lot more familiar than me with this codebase can weigh in on this.

Copilot

Pull request overview

Fixes a shutdown edge-case in PJLIB’s active socket write-completion path so higher layers still receive the send-completion callback, preventing reference leaks (e.g., callers that decrement pjsip_tx_data refs in on_data_sent).

Changes:

Invoke on_data_sent when SHUT_TX is set in ioqueue_on_write_complete() (instead of returning silently).

pjlib/src/pj/activesock.c

)

azertyfun · 2026-03-20T13:28:26Z

Actually with further testing I encountered another case where the callback still isn't called. This requires more thorough troubleshooting.

One major pain point is that the issue is only reproducible in high-load environments or with the ioqueue "fast track" commented out; else pj_ioqueue_sendto always returns PJ_SUCCESS in which case the caller is responsible for invoking the callback. When the fast-track fails, pj_ioqueue_sendto always returns PJ_EPENDING in which case the callee is responsible for invoking the callback. However, since this case is not normally covered by the test suite, it is evidently not guaranteed to be correct.

Marking this as draft until I figure out a more thorough solution.

The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction

azertyfun · 2026-03-20T16:19:28Z

Fixed a second leak. At this point I've lost all confidence that the original implementation has any understanding of the importance of not losing track of ioqueue keys. I'll make a thorough review next week.

The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.

The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction

Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)

azertyfun · 2026-03-25T09:55:46Z

Alright, I think I have reviewed all the relevant code paths I could think of for this issue. This led me to add a couple additional fixes compared to before (same logic of always invoking the callback in on_data_read + fix for pj_activesock_close()`. I also took the opportunity to add a regression test.

This is the mind map I used to review the possible journeys taken by a tdata object sent from pjsip_transport_send().

The fast-track remains enabled by default, but this is useful for troubleshooting. The ioqueue has two widly different behaviors: 1. In the fast-track (almost always used in low-load scenarios), sending a write_op always immediately returns with PJ_SUCCESS. The semantics in this case are that the _caller_ is responsible for calling any relevant callbacks. 2. Outside the fast track (relevant in high-load scenarios, which unfortunately do not seem to be extensively covered by regression tests), sending a write_op returns PJ_EPENDING in the happy path which changes the semantics; now the _ioqueue_ becomes responsible for the data owned by write_op and for calling the relevant callbacks. See pjsip#4864, pjsip#4878 for why this is a tricky behavior that can easily be missed and cause bugs. Disabling the fast track allows for more easily testing the second case.

The leak happens when 1. send_buf_pending is already in use 2. ssock_on_data_sent calls flush_circ_buf_output 3. send_buf_pending gets overwritten -> previous data in send_buf_pending is lost forever, its callback is never called, and therefore the reference to the tdata it contains is never decremented, preventing transport destruction

Before bugfixes in pjsip#4878, the test fails: 16:58:29.999 Test "concurrent_sends[i].sent" != 0 fails in activesock.c:469 (sent callback not called)

nanangizz · 2026-04-02T04:12:49Z

@azertyfun We've tried to address the issues mentioned earlier. Please let us know if you notice any further issues or have additional comments.

azertyfun · 2026-04-02T10:24:00Z

We deployed the earlier fix (which sends garbage over the wire – but that doesn't really matter in our use-case). However we had to roll it back because we unexpectedly encountered crashes.

I am currently investigating the cause. I still think the fix is sound in principle, but it may be triggering other existing race conditions. In particular this bit seems faulty: if the ioqueue is busy, ssl_send returns PJ_EPENDING, and wp is not removed from ssock->write_pending. However it has already been added to write_list, so it will be duplicated. I still need to look further to understand why this is only happening now.

nanangizz · 2026-04-02T11:54:08Z

You're right. Asked the AI to check that area, and it found several issues!!!

1. Wrong app_key passed to ssl_send
flush_delayed_send passes &wp->key (the internal SSL write_data_t key) instead of wp->app_key (the original caller's op_key). When the async callback fires, the TLS transport casts op_key to pjsip_tx_data_op_key* — but &wp->key is NOT a valid pjsip_tx_data_op_key, so accessing tdata_op_key->tdata and tdata_op_key->callback accesses invalid memory. Potential crash/UB.

2. PJ_EPENDING not handled — double-send
When ssl_send returns PJ_EPENDING (data encrypted + queued), wp stays in write_pending. Next flush_delayed_send call re-processes wp: ssl_write encrypts the same plaintext again → duplicate data on the wire, potential TLS record corruption.

3. Missing callback on PJ_SUCCESS — tdata leak
When delayed data is sent synchronously (PJ_SUCCESS), no on_data_sent callback fires. The app (which received PJ_EPENDING from the original pj_ssl_sock_send) expects a callback → tdata ref never decremented → leak.

4. send_buf_pending not cancelled on error in ssock_on_data_sent
When sent < 0 (error/cancel) and the app callback returns PJ_TRUE, ssock_on_data_sent tries to flush_circ_buf_output for send_buf_pending on a dead socket. This fails, and the send_buf_pending callback is never invoked → tdata leak.

Here is the draft patch (compile tested only, will test further next week).

azertyfun · 2026-04-02T13:40:04Z

After further investigation, the crash I described above was one of the outcomes, but the more common one was that we saw ssl0x7fa1caca6a00 Renegotiation failed: Not enough memory (PJ_ENOMEM) sometimes followed by a deadlock. In every case where there was a deadlock, there was this PJ_ENOMEM error. Unfortunately I do not have a workable core dump of those occurrences.

Now it is clearly related to the return PJ_ENOMEM this PR adds. What's unclear for now is how.

Thanks to the error message we know t hat ssl_do_handshake failed with PJ_ENOMEM due to the change I made. Before in the same situation it was silently failing with PJ_EPENDING instead, which essentially a no-op.

From there I can only speculate why there was a deadlock, but on_error calls on_handshake_complete which calls on_connect_complete which triggers a bunch of things in asterisk... so who knows what may have happened from there.

The whole logic for SSL handshaking is spread out, has multiple redundancies, and probably just as buggy as the rest of the code touched by this PR. Another thorough review is needed, but that's going to be quite a few additional man-hours which I'm starting to run short on as other priorities are catching up after several weeks working on this already.

And anyway, how can we know when we've squashed the last critical bug? How can we verify our fix when the interfaces are so complex and behave in radically different ways under load vs not, including in edge-cases like SSL renegotiation under load ? Unfortunately given that I caused a major incident by pushing a partial fix, we won't get approval for another attempt unless we have a complete explanation for all issues encountered and a way to reliably reproduce our race conditions and verify the correctness of any fixes.

Perhaps the solution for us will be to wait until PJ_IOQUEUE_CALLBACK_NO_LOCK is stabilized&released, and just rely on the TLS connection never breaking so we never have to encounter these buggy and untested edge-cases. Or perhaps switch to UDP/TCP through a VPN. Which I realize aren't real solutions and they do not help you. Given the scope of the architural and implementation issues in the SSL subsystem, short of a complete rewrite I honestly lack confidence that pjproject can ensure correctness.

…ip#4878) When pj_ioqueue_unregister() is called with pending async writes, on_write_complete callbacks were silently skipped. Upper layers (e.g., SIP transport) rely on these callbacks to release resources such as tdata references, causing reference leaks under high load. The fix drains pending write callbacks during key unregistration: - write_cb_list (completed ops, deferred callback): invoked with op->written (success status), respecting write_callback_thread serialization. - write_list (pending ops, never sent): invoked with -PJ_ECANCELLED. Changes across all maintained ioqueue backends (epoll, kqueue, select, IOCP). Also includes: - PJ_IOQUEUE_FAST_TRACK config to disable fast-track for testing - activesock: forward on_data_sent callback when SHUT_TX is set - Regression test for send callbacks on activesock close Co-Authored-By: Nathan Monfils <nathanmonfils@gmail.com> Co-Authored-By: Claude Code

nanangizz · 2026-04-08T06:32:48Z

@azertyfun Thanks for the detailed crash report and analysis. You're right — the PJ_ENOMEM guard in flush_circ_buf_output is problematic. It propagates through ssl_do_handshake() (all 5 SSL backends call flush_circ_buf_output directly with handshake_op_key), and the renegotiation handler in ssock_on_data_read only accepts PJ_SUCCESS/PJ_EPENDING — anything else triggers on_error → on_handshake_complete, which explains both the error message and the downstream deadlock in your application.

We've removed the SSL change from this PR. The ioqueue-level fix (draining write_list/write_cb_list on pj_ioqueue_unregister) is sufficient to address the original #4864 tdata leak — it catches all cases where the socket is closed with pending writes, regardless of whether SSL is involved.

The SSL send_buf_pending issues you identified (overwrite, flush_delayed_send wrong app_key, missing callbacks) are real but pre-existing. They deserve a proper fix, but the send path has deeper architectural limitations (single send_buf_pending slot, mixed handshake/app data in circ_buf_output, inconsistent error handling across backends) that make targeted patches risky — as your deployment experience showed. We plan to address these in a separate PR with a more thorough refactor of the SSL send path.

Could you try deploying this updated version (without the SSL change)? It should fix the tdata leak without triggering the renegotiation crash.

azertyfun · 2026-04-08T08:46:51Z

We already had a major customer-impacting incident which unfortunately means at this point I won't be able to get approval to push further fixes in production without very strong guarantees. We are now mitigating the TLS issues by staying on version 2.15.1 but reducing the number of registrations per TLS connection.

We will need proper assurances (refactor to enable thorough unit test coverage of the SSL subsystem, and probably additional integration testing with asterisk) before I will be allowed to push a fix through.

(To be clear: for the purposes of this PR I have no problem if it gets merged as-is, I just won't be able to verify that it fixes our specific production issues)

- Add PJ_UNUSED_ARG for status/sent variables in pj_ioqueue_send(), pj_ioqueue_sendto(), and pj_ioqueue_accept() when fast-track is disabled, fixing MSVC C4101 warnings. - Fix activesock UDP echo test: accept PJ_EPENDING as non-error from sendto() since async path always returns PJ_EPENDING when fast-track is disabled. Also fix error message to print actual send status instead of the recvfrom callback status. Co-Authored-By: Claude Code

Replace the ring-buffer-based send mechanism (send_buf + single send_buf_pending slot) with per-op pool-allocated ssl_send_op_t using embedded encrypted data buffer, free list with configurable cap, and true memory release on discard. This eliminates the PJ_ENOMEM crash during renegotiation reported in #4878. Key changes: - New ssl_send_op_t with per-op pool for true memory release - ssl_do_handshake_and_flush() wrapper unifying error handling across all 5 SSL backends (OpenSSL, GnuTLS, mbedTLS, Schannel, Darwin) - Fix flush_delayed_send: pass correct app_key, handle PJ_EPENDING to prevent double-send, invoke callback on synchronous success - Fix ssock_on_data_sent: check app_key (not send_key) for handshake/shutdown detection (was dead code before) - Fix ssock_on_connect_complete: return via on_handshake_complete instead of bare PJ_ENOMEM from pj_bool_t function - Rename circ_buf_output/input to ssl_write_buf/ssl_read_buf for clarity (not always circular — OpenSSL uses memory BIO) - Add "Caller must hold write_mutex" documentation to ssl_write() in all 6 backends Test improvements: - send_load_test: 200 rapid sends with async callback verification - large_msg_test: 64KB message (multi-TLS-record) echo verification - close_pending_test: close socket with pending sends (no crash) - bidir_test: bidirectional simultaneous send load - mt_send_load_test: 3 worker threads + 3 concurrent clients CI: add ioq-no-fast-track job (PJ_IOQUEUE_FAST_TRACK=0 + ASan) to force async send path in activesock and ssl_sock tests. Co-Authored-By: Claude Code

nanangizz · 2026-04-09T04:30:09Z

@azertyfun Thanks again for the detailed report and analysis on this issue. Sorry for the problems it caused in your production environment.

Building on the ioqueue fix merged here, we've done a broader SSL socket send path refactor in #4909 — replacing the ring-buffer mechanism with per-op pool allocation, fixing several related bugs found during the process, and adding stress tests that cover the high-load/concurrent scenarios you described. Unfortunately we don't currently have an Asterisk test setup, so we haven't been able to validate against that environment. Would appreciate any feedback or suggestions, especially if there are specific patterns from your environment we should add to the test coverage.

Add renegotiation option to send_load_test and bidir_test: - renego_at parameter triggers pj_ssl_sock_renegotiate() mid-stream - Supports both client-initiated and server-initiated renegotiation - Handle PJ_EBUSY from sends during renegotiation - Guard with #if for OpenSSL/GnuTLS only (TLS 1.2 required) Known failure: sends after renegotiation return PJ_EPENDING/PJ_EBUSY but callbacks never fire — flush_delayed_send doesn't deliver them. This is the pre-existing bug reported in #4878. Add concurrent_send_test (multi-threaded send on same socket): - 3 threads blast-send on the same SSL socket simultaneously - Simulates production SIP where worker threads share a TLS transport - Detects out-of-order TLS records (connection error) and data loss Add PJ_RACE_ME(5) in ssl_send between ssl_write and flush to widen the race window for concurrent send testing. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to actively probe for race conditions. Co-Authored-By: Claude Code

Enhance close_pending_test to verify that on_data_sent callbacks fire for all pending sends when the socket is destroyed. Previously the test only checked for crashes, not callback delivery. This catches the #4878 scenario where app resources leak due to lost callbacks. Add destroy_during_renego_test: triggers renegotiation mid-blast so that sends are delayed in write_pending, then immediately closes the socket. Verifies that error callbacks (-PJ_ECANCELLED) fire for all delayed sends during ssl_on_destroy. Both tests track err_cb_cnt separately to confirm that the destroy handler (not normal completion) fires the callbacks. Co-Authored-By: Claude Code

Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Renegotiation variants of send_load and bidir tests: client and server initiated, verify delayed sends flush after completion - destroy_during_renego_test: close socket while renegotiation is in progress, verify error callbacks fire for delayed sends CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code

Add stress tests that exercise the SSL send queue, renegotiation, and destroy paths: - send_load_test: blast sends and verify all complete with callbacks - close_pending_test: close socket with pending sends, verify on_data_sent callbacks fire (catches #4878 resource leak) - bidir_test: both sides send simultaneously, verify no data loss - mt_send_load_test: multi-threaded send on separate sockets - concurrent_send_test: 3 threads blast-send on same SSL socket, detects TLS record corruption from data mixing races - Client renegotiation variants (OpenSSL only): send_load, bidir, destroy-during-renegotiation - Server renegotiation variant (OpenSSL + GnuTLS): send_load Add RUN_SUBTEST macro to log failing sub-test name in output. CI: enable PJ_RACE_ME in ioq-no-fast-track matrix jobs to widen race windows for concurrent send testing. Co-Authored-By: Claude Code

azertyfun force-pushed the activesock-txdata-leak branch from c60348f to e608796 Compare March 19, 2026 17:27

sauwming requested a review from Copilot March 20, 2026 04:56

Copilot started reviewing on behalf of sauwming March 20, 2026 04:56 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

pjlib/src/pj/activesock.c Outdated Show resolved Hide resolved

pjlib/src/pj/activesock.c Show resolved Hide resolved

pjlib/src/pj/activesock.c Show resolved Hide resolved

azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026

Call on_data_sent with bytes_sent even if asock is shut down (pjsip#4878

226b170

)

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 20, 2026

Call on_data_sent with bytes_sent even if asock is shut down (pjsip#4878

af16a17

)

azertyfun force-pushed the activesock-txdata-leak branch from 226b170 to af16a17 Compare March 20, 2026 12:14

azertyfun marked this pull request as draft March 20, 2026 13:28

azertyfun force-pushed the activesock-txdata-leak branch from 4a54cd5 to b3440a7 Compare March 24, 2026 16:09

azertyfun force-pushed the activesock-txdata-leak branch from b3440a7 to 47cead4 Compare March 25, 2026 09:46

azertyfun marked this pull request as ready for review March 25, 2026 09:55

azertyfun mentioned this pull request Mar 25, 2026

[bug]: TLS connection cuts with "sslv3 alert bad record mac" when sending a lot of outbound requests asterisk/asterisk#168

Open

1 task

azertyfun pushed a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

31f22f4

azertyfun force-pushed the activesock-txdata-leak branch from 47cead4 to 31f22f4 Compare March 25, 2026 14:05

azertyfun added a commit to azertyfun/pjproject that referenced this pull request Mar 25, 2026

Fix unrelated test not correct checking for PJ_EPENIDNG (pjsip#4878)

bc22897

azertyfun force-pushed the activesock-txdata-leak branch from 31f22f4 to bc22897 Compare March 25, 2026 14:06

nanangizz force-pushed the activesock-txdata-leak branch from d395b2b to bc60815 Compare April 1, 2026 05:23

nanangizz requested a review from sauwming April 1, 2026 06:04

sauwming approved these changes Apr 1, 2026

View reviewed changes

sauwming linked an issue Apr 2, 2026 that may be closed by this pull request

tdata refcounting issue prevents SIPTLS transport shutdown #4864

Closed

nanangizz force-pushed the activesock-txdata-leak branch from bc60815 to aacb3bc Compare April 8, 2026 06:26

nanangizz merged commit af42c36 into pjsip:master Apr 8, 2026
5 checks passed

nanangizz mentioned this pull request Apr 8, 2026

Refactor SSL socket send path and improve test coverage #4909

Merged

7 tasks

Conversation

azertyfun commented Mar 19, 2026 • edited by nanangizz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azertyfun commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azertyfun commented Mar 20, 2026

Uh oh!

azertyfun commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azertyfun commented Mar 25, 2026

Uh oh!

nanangizz commented Apr 2, 2026

Uh oh!

azertyfun commented Apr 2, 2026

Uh oh!

nanangizz commented Apr 2, 2026

Uh oh!

azertyfun commented Apr 2, 2026

Uh oh!

nanangizz commented Apr 8, 2026

Uh oh!

azertyfun commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nanangizz commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

azertyfun commented Mar 19, 2026 •

edited by nanangizz

Loading

CLAassistant commented Mar 19, 2026 •

edited

Loading

azertyfun commented Mar 20, 2026 •

edited

Loading

azertyfun commented Apr 8, 2026 •

edited

Loading